This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show upfor their appointment. A number of characteristics about the patient are included in each row
Questions to ask:
- What is the probability that patients missed their scheduled appointments?
- Does patients age range has effect for absence?
- Does different neighborhoods have an effect in absence ratio?
- What other factors are important for us to know in order to predict if a patient will show up for their scheduled appointment?
- Does the sms sent to patients has an effect to attending the appointment?
# setting up import statements
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
sns.set_style('darkgrid')
# read data and printing out a few lines.
df = pd.read_csv("noshowappointments-kagglev2-may-2016.csv")
df.head()
| PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 2 | 4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29T00:00:00Z | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 3 | 8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29T00:00:00Z | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 4 | 8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
# checking missing rows and Dtype
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 110527 entries, 0 to 110526 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PatientId 110527 non-null float64 1 AppointmentID 110527 non-null int64 2 Gender 110527 non-null object 3 ScheduledDay 110527 non-null object 4 AppointmentDay 110527 non-null object 5 Age 110527 non-null int64 6 Neighbourhood 110527 non-null object 7 Scholarship 110527 non-null int64 8 Hipertension 110527 non-null int64 9 Diabetes 110527 non-null int64 10 Alcoholism 110527 non-null int64 11 Handcap 110527 non-null int64 12 SMS_received 110527 non-null int64 13 No-show 110527 non-null object dtypes: float64(1), int64(8), object(5) memory usage: 11.8+ MB
#checking for duplicates
sum(df.duplicated())
0
# checking if there is any nulls
df.isnull().sum()
PatientId 0 AppointmentID 0 Gender 0 ScheduledDay 0 AppointmentDay 0 Age 0 Neighbourhood 0 Scholarship 0 Hipertension 0 Diabetes 0 Alcoholism 0 Handcap 0 SMS_received 0 No-show 0 dtype: int64
# here we need to rename the columns and use _ as separator
df.rename(columns = { 'PatientId' : 'Patient_id', 'AppointmentID' : 'Appointment_ID', 'ScheduledDay':'Scheduled_day', 'AppointmentDay' : 'Appointment_day', 'No-show' : 'no_show'},inplace=True)
df.head()
| Patient_id | Appointment_ID | Gender | Scheduled_day | Appointment_day | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | no_show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 2 | 4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29T00:00:00Z | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 3 | 8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29T00:00:00Z | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 4 | 8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
# here we need to make it lower letters for all columns
df.rename(columns=lambda x: x.strip().lower(), inplace=True)
df.head()
| patient_id | appointment_id | gender | scheduled_day | appointment_day | age | neighbourhood | scholarship | hipertension | diabetes | alcoholism | handcap | sms_received | no_show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 2 | 4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29T00:00:00Z | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 3 | 8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29T00:00:00Z | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 4 | 8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
# here we need to change the format of appointment day to be just date without hours
df['appointment_day'] = pd.to_datetime(df['appointment_day'],format='%Y-%m-%dT%H:%M:%SZ')
df.head()
| patient_id | appointment_id | gender | scheduled_day | appointment_day | age | neighbourhood | scholarship | hipertension | diabetes | alcoholism | handcap | sms_received | no_show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 2 | 4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 3 | 8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29 | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 4 | 8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
# here we add column week_day to see it it have a relation with no_show
df['week_day'] = df['appointment_day'].dt.day_name()
df.head()
| patient_id | appointment_id | gender | scheduled_day | appointment_day | age | neighbourhood | scholarship | hipertension | diabetes | alcoholism | handcap | sms_received | no_show | week_day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No | Friday |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No | Friday |
| 2 | 4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No | Friday |
| 3 | 8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29 | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No | Friday |
| 4 | 8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No | Friday |
# here we checking if there is a not logic values
df.describe()
| patient_id | appointment_id | age | scholarship | hipertension | diabetes | alcoholism | handcap | sms_received | |
|---|---|---|---|---|---|---|---|---|---|
| count | 1.105270e+05 | 1.105270e+05 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 |
| mean | 1.474963e+14 | 5.675305e+06 | 37.088874 | 0.098266 | 0.197246 | 0.071865 | 0.030400 | 0.022248 | 0.321026 |
| std | 2.560949e+14 | 7.129575e+04 | 23.110205 | 0.297675 | 0.397921 | 0.258265 | 0.171686 | 0.161543 | 0.466873 |
| min | 3.921784e+04 | 5.030230e+06 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 4.172614e+12 | 5.640286e+06 | 18.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 3.173184e+13 | 5.680573e+06 | 37.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 9.439172e+13 | 5.725524e+06 | 55.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| max | 9.999816e+14 | 5.790484e+06 | 115.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 1.000000 |
# here we find row with -1 value in age
min_age = df['age'].min()
min_age
-1
df[df['age'] == min_age]
| patient_id | appointment_id | gender | scheduled_day | appointment_day | age | neighbourhood | scholarship | hipertension | diabetes | alcoholism | handcap | sms_received | no_show | week_day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 99832 | 4.659432e+14 | 5775010 | F | 2016-06-06T08:58:13Z | 2016-06-06 | -1 | ROMÃO | 0 | 0 | 0 | 0 | 0 | 0 | No | Monday |
# here we drop row with -1 value in age
df = df.drop(99832)
# here we find rows with 0 value in age
mini_age = df['age'].min()
mini_age
0
df1 = df[df['age'] == mini_age]
# here we drop rows with 0 value in age
df.drop(df1.index, inplace=True)
# checking again age column
df['age'].describe()
count 106987.000000 mean 38.316085 std 22.466214 min 1.000000 25% 19.000000 50% 38.000000 75% 56.000000 max 115.000000 Name: age, dtype: float64
# adding age range column to include age category
age_edges = [0,10,18,50,115]
age_names = ['kids', 'teen', 'adult', 'senior']
df['age_range'] = pd.cut(df['age'], age_edges, labels=age_names)
df.head()
| patient_id | appointment_id | gender | scheduled_day | appointment_day | age | neighbourhood | scholarship | hipertension | diabetes | alcoholism | handcap | sms_received | no_show | week_day | age_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No | Friday | senior |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No | Friday | senior |
| 2 | 4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No | Friday | senior |
| 3 | 8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29 | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No | Friday | kids |
| 4 | 8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No | Friday | senior |
# adding counter for each row to calculate for reference and use in exploratory
df.loc[:, 'counter'] = 1
df.head()
| patient_id | appointment_id | gender | scheduled_day | appointment_day | age | neighbourhood | scholarship | hipertension | diabetes | alcoholism | handcap | sms_received | no_show | week_day | age_range | counter | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No | Friday | senior | 1 |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No | Friday | senior | 1 |
| 2 | 4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No | Friday | senior | 1 |
| 3 | 8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29 | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No | Friday | kids | 1 |
| 4 | 8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No | Friday | senior | 1 |
df_noshow = df.query('no_show == "Yes"')
fig = px.pie(df, values='counter', names='no_show', title='Percentages of people didnt show')
fig.update_traces(textposition='inside', textinfo='percent+label+value')
fig.show()
fig = px.pie(df_noshow, values='counter', names='age_range', title='Percentages of different age_range that no show')
fig.update_traces(textposition='inside', textinfo='percent+label+value')
fig.show()
fig = px.pie(df_noshow, values='counter', names='neighbourhood', title='Percentages of different Hospital location that no show')
fig.update_traces(textposition='inside', textinfo='percent+label+value')
fig.show()
fig = px.pie(df_noshow, values='counter', names='week_day', title='Percentages of different week days no show')
fig.update_traces(textposition='inside', textinfo='percent+label+value')
fig.show()
fig = px.pie(df_noshow, values='counter', names='gender', title='Percentages of different gender that now show')
fig.update_traces(textposition='inside', textinfo='percent+label+value')
fig.show()
# Exploring data by hist
df_noshow[['age','scholarship','handcap']].hist(figsize=(10,7));
# exploring data with scatter pairplot
sns.pairplot(df_noshow, height=2.5);
no_show_counts = df.groupby(['sms_received', 'no_show']).count()['counter']
no_show_counts
sms_received no_show
0 No 60290
Yes 12112
1 No 25017
Yes 9568
Name: counter, dtype: int64
no_show_totals = df.groupby('sms_received').count()['counter']
no_show_totals
sms_received 0 72402 1 34585 Name: counter, dtype: int64
sms_yes_proportions = no_show_counts[1] / no_show_totals[1]
sms_no_proportions = no_show_counts[0] / no_show_totals[0]
sms_yes_proportions
no_show No 0.723348 Yes 0.276652 Name: counter, dtype: float64
sms_no_proportions
no_show No 0.832712 Yes 0.167288 Name: counter, dtype: float64
ind1 = np.arange(len(sms_yes_proportions))
width = 0.35
# plot bars
yes_bars = plt.bar(ind1, sms_yes_proportions, width, color='r', alpha=1, label='sent')
no_bars = plt.bar(ind1 + width, sms_no_proportions, width, color='b', alpha=1, label='not sent')
# title and labels
plt.ylabel('Proportion')
plt.xlabel('show')
plt.title('Proportion by sms received and no show')
locations = ind1 + width / 2 # xtick locations
labels = ['No', 'Yes'] # xtick labels
plt.xticks(locations, labels)
# legend
plt.legend();
- Statistics showing that 20.3% patients don't show up after schedule medical appointment and 79.7% of the patients show up .
- study shows that patients that not show in thier appointments from the adults with percentage 47.3%.
- it is obvious that neighbourhood have a relation to no show as it can be hard to reach the location of the hopital or hard transportation.
- the study shows also that most of the patients that don't show up after schedule medical appointment are females with percentage 65.8% .
- logically we can think that when we send sms to patients that it will negatively effect the no show ratio but the study shows that it have no relation as when hospitals send sms's absence ratio increased.